Isolation, complete genome sequencing and in silico genome mining of Burkholderia for secondary metabolites

Recent years, Burkholderia species have emerged as a new source of natural products (NPs) with increasing attractions. Genome mining suggests the Burkholderia genomes include many natural product biosynthetic gene clusters (BGCs) which are new targets for drug discovery. In order to collect more Burkholderia, here, a strain S-53 was isolated from the soil samples on a mountain area in Changde, P.R. China and verified by comparative genetic analysis to belong to Burkholderia. The complete genome of Burkholderia strain S-53 is 8.2 Mbps in size with an average G + C content of 66.35%. Its taxonomy was both characterized by 16S rRNA- and whole genome-based phylogenetic trees. Bioinformatic prediction in silico revealed it has a total of 15 NP BGCs, some of which may encode unknown products. It is expectable that availability of these BGCs will speed up the identification of new secondary metabolites from Burkholderia and help us understand how sophisticated BGC regulation works.


Introduction
The prevalence of drug-resistant pathogens has been a serious problem and effected the human life and agriculture. The World Health Organization (WHO) estimates ten million deaths by 2050 if multi-drug resistant (MDR) infections are not appropriately managed [1,2]. All major antibiotic classes have been found to have antimicrobial resistance, and the number of candidates for novel antibiotics is dwindling. Hence screening novel antibacterial compounds is critical for new drug discovery [3].
Microbial natural products are the important sources of drug discovery because of their structural diversity to make up more than 75% of antibiotics [4,5]. The 99.99% rediscovery rate in traditional discovery pipelines of natural products is a big drawback [6]. However, the last decade has been a revival time for natural product discovery which was fueled by advances in analytical chemistry, bioinformatics, and whole genome sequencing [7].
Microbial genome sequencing revealed that they contain huge sources of cryptic BGCs, which have a larger capability to produce secondary metabolites. Availability of whole genome sequences and synthetic biologyinspired tools/approaches make it possible to utilize these BGCs to develop new chemicals with new structures, new activity and new targets [8].
Modern natural product discovery relies on, to a higher extent, on the microbial genome sequencing and computer mining for BGCs. Next stages include selecting unique BGCs, cloning and expressing selected BGCs in an optimal heterologous host or activating in situ silent *Correspondence: liruijuan@sdu.edu.cn; ayli@sdu.edu.cn 1 Helmholtz International Lab for Anti-Infectives, State Key Laboratory of Microbial Technology, Shandong University-Helmholtz Institute of Biotechnology, Shandong University, Qingdao 266237, People's Republic of China Full list of author information is available at the end of the article BGCs. This pipeline (genome mining of NPs) takes less time on dereplication and streamlines NP discovery via the use of advanced computational, microbiological and synthetic biological approaches, to more extents, compared to traditional screening methods.
Most members of Burkholderia are well-known as pathogens to their hosts (plants or human) and now 44 members in this genus have been identified [9]. In recent years, many species of Burkholderia have been found to have the ability to excrete a range of secondary metabolites, including antibacterial, anticancer, herbicidal, and insecticidal chemicals that can act as bioremediation, biocontrol and plant growth promotion agents [10,11].
More recently, the increasing data of Burkholderia genome sequences have shown a vast reservoir of NPs, such as non-ribosomal peptides (NRPs) and polyketides (PKs), with various pharmacological functions [12]. Many silent BGCs in Burkholderia genomes remain unexplored as potential drug development targets. Using genome mining approaches, many compounds, such as bolagladins/glidochelins, gladiofungins, thailandepsins/ burkholdacs, romidepsin (FK228) and so on, were discovered from Burkholderia [13].
Due to the restrictive growth conditions, only a limited number of Burkholderia species have been isolated and identified as having NP BGCs or NP producers. Thus, the isolation of more species in Burkholderia from various environments and high-quality sequencing of Burkholderia genomes still are necessary for multi-omics research, which aids in the understanding of BGC regulation and rationally designing biosynthetic pathways of NPs [14,15].
The purpose of this research was to determine the potential of a Burkholderia strain S-53 obtained from a small mountain area, which showed a quicker growth rate among three species of Burkholderia. Its genome was sequenced and analyzed for the presence of putative NP BGCs. Our data revealed this strain contains a substantial number of BGCs, indicating that its potential capability of producing new chemicals with biological activity.

Isolation and characterization of Burkholderia
We collected soil samples from a small mountain (location: Tiesi Gang in Zoushi Town, 29.12755 N,111.564903E) in Changde City, Hunan Province, P.R. China using sterilized spoons. Soil samples were pretreated by drying at room temperature and then soaked in PBS buffer (10 mL PBS/g soil). Pretreated samples were serially diluted in PBS and seeded onto solid CYMG (8 g/l Casein peptone, 4 g/L Yeast extracts, 4.06 g/L MgCl 2 ·2H 2 O, and 10 ml/L 50% Glycerin) medium, then cultivated at 28 ℃ for 2 days. Whitish colonies were analyzed by colony PCR for 16S rRNA amplification with universal primers 27F(5'-AGA GTT TGA TCC TGG CTC AG-3) and 1492R (5'-TAC GAC TTA ACC CCA ATC GC) under the standard PCR conditions (95℃ for 5 min, then 30 cycles of 94℃ for 1 min, 55-58℃ for 1 min and 72℃ for 90 s), followed by sequencing in Sangon Biotech (Shanghai) to pick out Burkholderia species. Morphological features of S-53 were recorded when cultivating on CYMG agar plates and molecular taxonomic approaches via TrueBac ™ ID system and Type Strain Genome Server (TYGS)) were used to characterize the resultant isolates.

Measurement of the growth curve of S-53
S-53 was inoculated into CYMG microwell plates (400 μ L CYMG broth in each well) using 15 wells as parallel groups, and cultivated at 30℃ for 30 h. During cultivation, OD 600 values for each well were recorded once at 1 h interval. Taken the OD 600 value of each parallel well at the 0 h as the blank control, the difference (OD 600/n-h -OD 600/0-h ) between the OD 600 at each time-point (n-h) and OD 600 at 0 h was calculated to represent the growth of S-53. Using the average values of OD 600/n-h -OD 600/0h as Y-axis and time-point per h as X-axis, the growth curve of S-53 was obtained.

Extraction of high molecular weight genomic DNA
Burkholderia strain S-53 was inoculated into 50 mL of CYMG liquid culture medium with glass beads (3 ± 0.3 mm diameter) in a 250 mL baffled flask and cultured for 24 h at 30 °C in a 200-rpm orbital shaker. To extract genomic DNA (gDNA), 50 mL cultivated cells were collected during the exponential growth phase and washed twice with the same amount of 10 mM EDTA followed by 45 min at 37 °C with lysozyme (10 mg /mL). gDNA for gram negative bacteria was extracted using TIANamp Bacterial DNA kit from Tiangen Biochemical Technology (Beijing) Co., Ltd, according to the instructions from the manufacturer. We determined the quality and amount of extracted gDNA samples using 1% agarose gel electrophoresis on Nanodrop (Thermo Fisher Scientific, Waltham, MA, USA).

de novo Genome sequencing, assembly and annotation
To get fine sequence data, gDNA of S-53 was submitted to GENEWIZ Biotechnology Co., Ltd in Tianjin, China for genome sequencing with two methods: For Illumina sequencing, firstly, DNA was fragmented into around 500 bp, repaired for blunt ends, and then modified with the base "A" through the 3' end, so that the DNA fragments can be connected to the linker with the "T" base at the 3' end. The target fragment ligation product is recovered, and then PCR is used to amplify the DNA fragments with adapters at both ends, and finally the qualified library is used for cluster preparation and sequencing. For PacBio sequencing, 5-10 μg genomic DNA was sheared into 10-15 kb fragments using a g-TUBE device. Then library was constructed using the SMRTbell ® Express Template Preparation Kit 2.0. The PCR products obtained using library DNA as templates were cleaned up and validated using an Agilent 2100 Bioanalyzer. Next, the qualified libraries were sequenced with pair-end PE150 on the illumina Hise-qXten/Novaseq/MGI2000 System or on Sequel II sequencing platform.
Finding coding genes was conducted using the Prodigal [22]/Augustus [23] gene-finding software while detection of transfer RNAs (tRNAs) was done using the program tRNAscan-SE [24] with default parameter settings. rRNAs were identified by using Barrnap. Other RNAs were identified by rfam database. By BLAST using National Center for Biotechnology Information (NCBI) NR database, the coding genes were annotated (screening conditions were displayed in Table 1).
GO [25] (Gene Ontology) database and KEGG [26] (Kyoto Encyclopedia of Genes and Genomes) database were used for analyzing functions of genes and annotating the pathways. The database of COG/KOG [14] (Clusters of Orthologous Groups) was used for phylogenetic classification of proteins.

Phylogenetic analysis
Two methods were used for phylogenetic analysis of S-53: (i) Whole genome-based taxonomic analysis was conducted using the Genome BLAST Distance Phylogeny approach (GBDP) by uploading genome sequence data to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform accessible at https://tygs.dsmz.de [27]. (ii) A phylogenetic tree was constructed based on the 16S rRNA gene sequence of the Burkholderia strain S-53 and those extracted from the list of hits from EzBio-Cloud 16S database [28]. Evolutionary trees were established with maximum-likelihood methods [29] in MEGA X package [30]. The confidence of the tree topologies was assessed by 100 bootstrap replicates.

Whole genome sequences for bacterial identification
Bacterial identification utilizing whole genome sequences was conducted on the TrueBac ™ ID technology, a cloudbased service [31] to reveal the genuine identification of bacterial isolates using a multitude of methods.

Comparative genomic studies/whole genome relatedness
For a whole genome-based taxonomic analysis, the genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform accessible at https:// tygs. dsmz. de (accessed 28 December 2021). The Genome BLAST Distance Phylogeny approach (GBDP) was used to calculate dDDH (digital DNA-DNA hybridization) values and construct minimum evolution trees using TYGS [32,27]. MEGA-X [30] was used to visualize GBDP trees. The ANI/AAI-Matrix calculator was used to calculate the average nucleotide identity (ANI) [33,34]. The average amino acid identity (AAI) and average nucleotide identity (ANI) matrices of all conserved genes in the core genome were computed by the BLAST algorithm and visualized as heat maps for a more in-depth qualitative comparison between the genomes. Using EZBIOCLOUD, the average nucleotide identity (ANI) of the assembled genome nucleotide files was calculated against the whole genome sequences of the strains used for 16S rRNA sequence analysis [35]. This method computes nucleotide identity through pairwise sequence alignment, yielding an overall average similarity of the genomes that is independent of sequence length.
The CGView (http:// cgview. ca/) was used to generate a graphical representation of the BLAST result comparison of the available genomes to the genome of Burkholderia strain S-53.

Secondary metabolite biosynthetic gene cluster prediction
As a main approach for finding and annotating genes in BGCs across the genome, antiSMASH 6 [36] combined with ClusterBlast, ActiveSiteFinder, ClusterBlast, Cluster PFam analysis, SubClusterBlast, PRISM 4 and BAGEL 4 [36] was used for discovery of BGCs in the genome of S-53 for secondary metabolites.
Particularly, BAGEL 4 was used to mine BGCs for RiPPs and bacteriocin, whereas PRISM 4 was designed for structural prediction of secondary metabolites [37]. Several database systems, including the principles of hidden Markov model (HMM) [38], BLAST algorithm [39], PFAM [40], GenBank [41], UniprotKB [42], bactibase [43], To indicate genome sizes inside and outside, the ruler was used in the chromosome map CAMPR3 [44], and the MiBig data repository [45] were used for BGC annotation. As well, NapDos was used [46] to look for KS and C domains in these genomic sequences.

Morphological and microscopic examination and phylogenetic analysis of 16S rRNA
In order to isolate more species of Burkholderia from the soil samples, we incubated a serial of isolates on CYMG medium at 30℃, followed by colony PCR amplification for 16S rRNA gene. Next by 16S rRNA-based phylogenetic analysis, 3 isolates were identified as Burkholderia, representing different species: S-53 shared the highest gene identity of 16S rDNA (99.93%) with the type strain Burkholderia stabilis (NCBI Blastn). The S-53 colonies on CYMG medium were recorded (Fig. 1). The partial 16S rDNA gene sequence of the S-53 strain, 1337 bps in length, was deposited in the GenBank nucleotide database with an accession number of OM019084.
Among three strains we isolated, we found S-53 grows more rapidly (much shorter than 18 h into its    (Fig. 1b). Because a higher growth rate is an important feature for species of Burkholderia for expressing of NPs, we chose S-53 for next genome sequencing.

Genomic features of Burkholderia strain S-53
The genome of Burkholderia strain S-53 is 8.254 Mbps in length and composed of 7239 protein-encoding genes, 63 tRNA genes, 18 rRNA genes and 72 ncRNA genes (Table 1 and Table 2). Figure 2 showed a circular chromosome based S-53 genome sequence using CG View server (http:// cgview. ca/), which is a web-based tool for comparative genomics analysis on circular genomes [47].
The genome sequence of the Burkholderia strain S-53 has been deposited at GenBank under the GenBank accession CP090482-CP090484.

Bacterial strain identification by whole genome sequence and comparative genome analysis of S-53
Here, using TrueBac ™ ID system [31] for bacterial identification based on whole genome sequence of S-53 strain, it could be identified as Burkholderia pyrrocinia (Table 3).
Further, we performed comparative genome analysis of Burkholderia S-53 (Table 4): the pairwise comparison of Burkholderia strain S-53 was recorded from TYGS [27] which is a fast-increasing discipline of genome-based taxonomy descriptions of new genera, species, and subspecies (https:// tygs. dsmz. de/).

Phylogenetic analysis via GBDP method
Using Genome BLAST Distance Phylogeny (GBDP) method and tree builder service, the phylogeny tree of Burkholderia strain S-53 using its whole genome sequence was created while FastME was used to estimate the tree using GBDP intergenomic distances derived from complete proteomes. GBDP phylogenetic tree constructed by using 16S rRNA indicated that S-53 is similar to Burkholderia pyrrocinia DSM10685K (Fig. 3a). On the other hand, GBDP phylogenetic tree constructed by using whole genome indicated that S-53 is similar to B. stabilis ATCCBAA-67 (Fig. 3b). In all, 16S rRNA-based GBDP phylogenetic tree and whole genome alignment and comparative genome analysis suggested it to be Burkholderia pyrrocinia, while GBDP phylogenetic tree constructed by using whole genome supported it to be B. stabilis. Combining these analyses, we concluded it to be closer to Burkholderia pyrrocinia.
On the other hand, the phylogenetic tree was constructed from EzBioCloud 16S database by maximumlikelihood methods by Mega X application with 100 bootstrap values depicted in the Fig. 4. According to the maximum likelihood method, S-53 is close to the Burkholderia stabilis ATCC BAA-67 and Burkholderia pyrrocinia DSM 10,685.

Fig. 4
Evolutionary analysis of S-53 using the Maximum Likelihood method. The proportion of phylogenetic trees with the same taxonomy is given next to the branches. For the heuristic search, we used Neighbor-Join and BioNJ algorithms on a matrice of pairwise distances evaluated using Maximum Composite Likelihood (MCL) and chose the topology with the best log likelihood value. The tree's branch lengths are measured in substitutions per location. These 51 nucleotides were studied. Gaps and incomplete data were removed from all spots (complete deletion option). The final dataset has 1279 positions. MEGA X was used to study evolution Among 15 BGCs, five BGCs (cluster 3, 5, 8, 10 and 13 regions 1.3, 2.1, 3.1, 3.3 and 3.6) were more than 50% identical to known BGCs. Other BGCs exhibited just a low degree of similarity or resemblance to previously identified BGCs, implying that Burkholderia sp. S-53 has a significant potential for the production of novel NPs in the future.
Moreover, we performed BAGEL analysis on S-53 genome and identified additional 2 different clusters for bacteriocins and RiPPs (Table 7).

Discussion and conclusion
The high potential of Burkholderia to produce bioactive NPs has been reported with an increasing publishing record in decade years. Moreover, the rapid growth rate and low fermentation cost make them as a potential host Table 6 The analysis of biosynthetic pathways in Burkholderia sp. S-53 by antiSMASH 6.0 * NRPS-Non ribosomal peptide synthetase cluster # The "similarity" is the percent of homologous genes in the query and hit clusters. As defined by antiSMASH, the homologous genes were chosen for their high sequence identity (> 30%) and short BLAST alignments (> 25%)   for heterologous expression of some NP BGCs, aided by the establishment of genetic manipulation systems [50]. In this work, genomic investigation of microbes isolated from underexplored mountain habitats found several strains of Burkholderia species, among of which, S-53 attracted us for its comparatively quicker growth rate (16-18 h for entering stationary stage, compared to 24 h for general Burkholderia species), as a critical feature when it could be developed into a host for expressing some NP BGCs later.
We identified it using different methods and analyzed its evolution. Given that modern bacterial taxonomy uses genome sequence data to identify taxa, by means of genome sequence data, identification of a bacterial species is considered always to be more correct and persuasive. Thus, though different molecular methods for bacterial identification gave a little different result, our data deduced it to be Burkholderia pyrrocinia.
Other subspecies of Burkholderia pyrrocinia were ever isolated from different habitats, such as Burkholderia pyrrocinia JK-SH007, a plant growth-promoting bacteria from plat rhizosphere [51]. Burkholderia pyrrocinia, along with Burkholderia cenocepacia and Burkholderia ambifari, was referenced as Burkholderia cepacia complex (BCC) species, which are most frequently associated with roots of crop plants [52].
Taxonomy and identification of species in Burkholderia still are quite challenging. Though a high similarity of 16S rDNA ranging 98-100% often is used as "common standard" for bacterial identification at species level, it could not be applicable to classification of Burkholderia species [53], especially, for classification of BCC group of Burkholderia. So, the whole-genomesequence-based taxonomic analysis could give comparably more reliable results, when combining other molecular methods.
Genomics-based bottom-up techniques have been developed to reveal previously undiscovered natural product biosynthesis pathways [54]. Here, whole genome sequencing and bioinformatic analyses of Burkholderia strain S-53 revealed many secondary metabolite biosynthetic gene clusters. Moreover, bioinformatics analysis uncovered more than two-thirds of BGCs in S-53 are not related to recognized clusters (Table 6).
These data supported that S-53 could be a good candidate used for identifying new NPs. Next, more research is needed to improve, isolate, and identify new bioactive natural products from this strain and to investigate the possibility of it to be as chassis for expressing of new NPs.